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Abstract 

We consider the problem of statistical learning for the intensity of a count- 
ing process with covariates. In this context, we introduce an empirical risk, 
and prove risk bounds for the corresponding empirical risk minimizers. Then, 
we give an oracle inequality for the popular algorithm of aggregation with ex- 
ponential weights. This provides a way of constructing estimators that are 
adaptive to the smoothness and to the structure of the intensity. We prove 
that these estimators are adaptive over anisotropic Besov balls. The probabilis- 
tic tools are maximal inequalities using the generic chaining mechanism, which 
was introduced by Talagrand (2005), together with Bernstein's inequality for 
the underlying martingales. 

Keywords. Counting processes. Statistical learning. Adaptive estimation. Em- 
pirical risk minimization. Aggregation with exponential weights. Generic chain- 
ing 

1 Introduction 

Over the last decade, statistical learning theory (initiated by Vapnik, see for instance 
Vapnik (2000)) has known a tremendous amount of mathematical developments. By 
mathematical developments, we mean risk bounds for learning algorithms, such as 
empirical risk minimization, penalization or aggregation. However, in the vast ma- 
jority of papers, such bounds are derived in the context of regression, density or 
classification. In the regression model, one observes independent copies of (X, y), 
where X is an input, or a covariate, and K is a real output, or label. The aim is 
then to infer on E(F|X). The aim of this paper is to study the same learning algo- 
rithms (such as empirical risk minimization) in a more sophisticated setting, where 
the output is not a real number, but a stochastic process. Namely, we focus on the 
situation where, roughly, the output is a counting process, which has an intensity 
that depends on the covariate X. The aim is then to infer on this intensity. This 
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framework contains many models, that are of importance in practical situations, such 
as in medicine, actuarial science or econometrics, see Andersen ct al. (1993). 

In this paper, we give risk bounds for empirical risk minimization and aggrega- 
tion algorithms. In summary, we try to "find back" the kind of results one usually 
has in more "standard" models (see below for references). Then, as an application 
of these results, we construct estimators that have the property to adapt to the 
smoothness and to the structure of the intensity (in the context of a single-index 
model). Several papers work in a setting close to ours. Model selection has been 
first studied in Reynaud-Bouret (2003) for the non-conditional intensity of a Pois- 
son process, see also Reynaud-Bouret (2006), Birge (2007), Baraud and Birge (2009) 
and Brunei and Comte (2005). Model selection for the same problem as the one 
considered here has been studied in Comte et al. (2008). 

The agenda of the paper is the following. In this Section, we describe the general 
setting and the corresponding estimation problem. Section 1.2 is devoted to a pre- 
sentation of the main examples embedded in this setting. The main objects (such as 
the empirical risk) and the basic deviation inequalities are described in Section 2. In 
Section 3, we give risk bounds for the empirical risk minimization (ERM) algorithm. 
To that end, we provide useful uniform deviation inequalities using the generic chain- 
ing mechanism introduced in Talagrand (2005) (see Theorem 1 and Corollary 1), and 
we give a general risk bound for the ERM in Theorem 3 and its Corollary 2. In 
Section 4, we adapt a popular aggregation algorithm (aggregation with exponential 
weights) to our setup, and give an oracle inequality (see Theorem 4). In Section 5, 
we use the results from Sections 3 and 4 to construct estimators that adapt to the 
smoothness and to the structure of the intensity. We compute the convergence rates 
of the estimators, that are minimax optimal over anisotropic Besov balls. Section 6 
contains the proofs. Some useful results and tools are recalled in the Appendices. 

1.1 The model 

Let {fl,J-',P) be a probability space and {J-'t)t>o ^ filtration satisfying the usual 
conditions, see Jacod and Shiryaev (1987). Let iV be a marked counting process with 
compensator A with respect to {J^t)t>o, so that M = — A is a (J-t)t>o-martingale. 
We assume that A^ is a marked point process satisfying the Aalen multiplicative 
intensity model. This means that A writes 



for all t > 0, where: 

• aQ is an unknown deterministic and nonncgativc function called intensity; 
9 X E is a .Fo-measurable random vector called covariates or marks; 

• y is a predictable random process in [0, 1]. 

With differential notations, this model can be written has 




(1) 



dN{t) = ao(i, X)Y{t)dt + dM{t) 



(2) 
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for all t > with the same notations as before, and taking iV(0) — 0. Now, assume 
that we observe n i.i.d. copies 

D„ = N'{t), Y'{t)) : t e [0, 1], 1 < z < n} (3) 

of {(X, N{t),Y{t)) : t e [0, 1]}. This means that we can write 

dN\t) = ao{t,X,)Y\t)dt + dM\t) 

for any i = 1, . . . ,n where are independent (J^t)(>o-martingales. In this setting, 
the random variable (t) is the number of observed failures during the time interval 
[0,t] of the individual i. 

The aim of the paper is to recover the intensity ao on [0, 1] based on the observa- 
tion of the sample Dn- This general setting includes several specific problems where 
the estimation of ao is of importance for practical applications, see Section 1.2. In all 
what follows, we assume that the support of Px is compact, but in order to simplify 
the presentation, wc shall assume the following. 

Assumption 1. The support of Px is [0,1]'', and 

||a||oo := sup \a{t,x)\ (4) 

(t,2;)e[0,l]''+i 

is finite. 

These assumptions on the model are very mild, excepted for the i.i.d assumption 
of the sample, meaning that the individuals i are independent. Let us give several 
examples of interest that fit in this general setting. 

1.2 Examples 

1.2.1 Regression model for right-censored data 

Let T be a nonnegative random variable (r.v.) and X a vector of covariates in R'^. 
In this model, T is not directly observable: what we observe instead is 

:= min(r, C) and S := I{T < C), (5) 

where C is a nonnegative random variable called censoring. This setting, where the 
data is right censored, is of first importance in applications, especially in medicine, 
biology and econometrics. In these cases, the r.v. T can represent the lifetime of an 
individual, the time from the the onset of a disease to the healing, the duration of 
unemployment, etc. The r.v. C is often the time of last contact or the duration of 
follow-up. In this model we assume the following mild assumption: 

T and C are independent conditionally to X, (6) 

which allows the censoring to depend on the covariates, see Heuchenne and Van Keilegom 
(2007). This assumption is weaker than the more common assumption that T and C 
are independent, see in particular Stute (1996). 
In this case, the counting process writes 

N\t) = I{Tf <t,Si = 1) and ¥' : Y'{t) = I{Tf > t), 
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see e.g. Andersen et al. (1993). In this setting, the intensity ao is the conditional 
hazard rate of T given X = x, which is defined for aU t > and a; e M'' by 



fT\x{t,x) 



l-FT\x{t, x)' 



where /t|x and F^p^x are the conditional probability density function (p.d.f.) and 
the conditional distribution function (d.f.) of T given X respectively. The available 
data in this setting becomes 



where {Xi,Tf ,Si) are i.i.d. copies of {X,T'-^ ,S), where we assumed (6), namely Ti 
and Ci are independent conditionally to Xi for 1 < i < n. 

The nonparametric estimation of the hazard rate was initiated by Bcran (1981), 
Stute (1986), Dabrowska (1987), McKeague and Utikal (1990) and Li and Doss (1995) 
extended his results. Many authors have considered semiparametric estimation of the 
hazard rate, beginning with Cox (1972), see Andersen et al. (1993) for a review of 
the enormous literature on semiparemetric models. We refer to Huang (1999) and 
Linton ct al. (2003) for some recent developments. As far as we know, adaptive 
nonparametric estimation for censored data in presence of covariates has only been 
considered in Brunei et al. (2007), who constructed an optimal adaptive estimator of 
the conditional density. 

1.2.2 Cox processes 

Let 77% 1 < i < 71, be n independent Cox processes on M+, with mean-measure A* 
given by : 



where Xi is a vector of covariates in K.''. This is a particular case of longitudinal data, 
see e.g. Example Vn.2.15 in Andersen et al. (1993). The nonparametric estimation 
of the intensity of Poisson processes without covariates has been considered in several 
papers. We refer to Reynaud-Bouret (2003) for the adaptive estimation (using model 
selection) for the intensity of nonhomogeneous Poisson processes in a general space 

1.2.3 Regression model for transition intensities of Markov processes 

Consider a rt-sample of nonhomogeneous time- continuous Markov processes , . . . 
with finite state space {1, . . . , /c} and denote by Xji the transition intensity from state 
j to state /. For an individual i with covariate Xi, the r.v. N^i{t) counts the number 
of observed direct transitions from j to I before time t (we allow the possibility of 
right-censoring for example). Conditionally on the initial state, the counting process 
verifies the following Aalen multiplicative intensity model: 



where Yj{t) = I[P^{t—) = j) for all t > 0, see Andersen et al. (1993) or Jacobsen 
(1982). This setting is discussed in Andersen et al. (1993), see Example VILll on 
mortality and nephropathy for insulin dependent diabetics. 



) : 1 < i < n]. 




Xji{X„ z)Y^{z)dz + M\t) for all t > 0, 
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We finally cite three papers, where the estimation of the intensity of counting pro- 
cesses was considered, gathering as a consequence all the previous examples, but in 
none of them the presence of covariates was considered. Ramlau-Hansen (1983) pro- 
posed a kernel-type estimator, Gregoire (1993) studied least squares cross-validation. 
More recently, Reynaud-Bourct (2006) considered adaptive estimation by projection 
and Baraud and Birgc (2009) considered the adaptive estimation of the intensity of 
a random measure by histogram-type estimators. 

1.3 Some notations 

From now on, we will denote by L an absolute constant that can vary from place to 
place (even in the same line), and by c a constant that depends on some parameters, 
that we shall indicate into subscripts. In all of what follows, Z?„ is an i.i.d. sample 
satisfying model (2), and we take {X, (Yt), [Nt)) independent of that satisfies also 
model (2). Note that we will use both notations {Zt)t>o and {Z{t))t>o for a stochastic 
process Z. We denote by P"[-] the joint law of D„ and P[-] the law of (X, (Ff), (A^f)), 
and by E"[-] and E[-] the corresponding expectations. 

2 Main constructions and objects 
2.1 An empirical risk 

Let a; G R'' and [yt), {nt) be functions [0,1] with bounded variations, and 

let a : [0, 1]*^+^ — > R"*" be a bounded and predictable function (that can eventually 
depend on Z)„). We define the loss function 



This quantity measures the goodness-of-fit of a to the data from It has been 
used in Comte et al. (2008) to perform model selection. It is the empirical version of 
the theoretical risk 




We define the least-squares type empirical risk of a as: 



Pniic) :=-5^^a(X„(F/),(7V;)) 




F(£„) :=E[4(X, {Yt),{Nt))\D„] 



= E 




n 
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This risk is natural in this model. Indeed, if a is independent of D„, we have in view 
of (2), since M{t) is centered: 
»i 



where: 

(a, ao) 



{a{t, Xf - 2a{t, X)aoit, X))Y{t)dt 

I 

2-2(a,ao) 



- 2E 









X)dM{t) 



\a - ao\ 



a(i,a;)ao(i,a;)E[y(t)|X = x\dtPx{dx) 



(8) 
(9) 

(10) 



This is an inner product with respect to the bounded measure (it is smaller than 1) 

dii{t,x) := ^\Y{t)\X = x]dtPx{dx). (11) 

We will denote by L^(/^) the corresponding Hilbert space, and define L°°(^) as the 
subset of L^(/^) consisting of functions a such that ||a||oo < +oo. 



In view of (8), P{ia) — P{(-ao) (called excess risk) is equal to \\a 



aol 



As a 



consequence, ao minimizes a ^ P{ia), so a natural way to recover ao is to take a 
minimizer of a i— > Pn{(-a)- This is the basic idea of empirical risk minimization, for 
which we propose risk bounds in Section 3 below. Let us define the empirical norm 



1 = 1 



a{t,X,fY'\t)dt, 



so that we have E"||a||^ = |ja|p if a. is deterministic. Note that ||a| 
||q:|| < ||a||oo- An important fact is that (2) entails 

2 



< Hall 



where Zn{-) is given by 



a - ao I 



--Znia - ao), 



a{t,X,)dM\t), 



(12) 
and 

(13) 
(14) 



where NP are the independent copies of the martingale innovation from (2). The 
decomposition (13) will be of importance in the analysis of the problem. 

Remark 1 (Regression model for right-censored data). In the problem of censored 
survival times with covariates, see Section 1.2.1, the semi-norm of estimation becomes 

r-l 



Ia||2 = 



a(t,x) Hrpc\x{t,x)dtPx{dx), 



where Hrpc^xit^x) := P[T^ > t\X = x], and where by (5) and (6): 

HTC\x{t,x) = P[r > t\X = x]P[C > t\X = x]. 

This weighting of the norm is natural and, somehow, unavoidable in models with 
censored data. The same normalization can be found, for instance, in the Dvoretzky- 
Kiefer-Wolfowitz concentration inequality for the Kaplan-Meier estimator (without 
covariates), see Theorem 1 in Bitouze et al. (1999). 
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2.2 Deviation inequalities 

Let us denote by (Z) the predictable variation of a random process Z . Note that we 
have, using Assumption 1: 

A useful result is then the following. First, introduce, for 5 > 0, 

^r.Ah) :=logE"[e"^"(")l(z„(a))<5^] 
and the Cramer transform ?/;* ^(z) sup,j^q(/iz — i'n,5{h)). 

Proposition 1. For any bounded a and any z,S > 0, the following inequality holds: 

C.W>f^.(f%), (16) 

where g{x) := (1 + x) log(l + x) — x. 

This result and the deviation inequalities stated below are related to standard re- 
sults concerningmartingales with jumps, see Liptser and Shiryayev (1989), van de Geer 
(1995) or Reynaud-Bouret (2006), among others. For the sake of completeness we 
give a proof of Proposition 1 in Section 6. From the minoration (16), we can de- 
rive several deviation inequalities. Using the Cramer-Chernoff bound P"[Z„(q;) > 
z, {Zn{a)) < S^] < exp(— f/;^ si^)), we obtain the following Benett's inequality: 

nZ^ia) > z, (Z„(a)) < 6^] < exp ( - ^g(f^) 

for any z > 0. As a consequence, since 17(2:) > 3a;^/(2(a; + 3)) for any x > 0, we obtain 
the following Bernstein's inequality: 



P"[Z„(a) > z, (ZUa)) < S'] < exp - . (17) 

V 2(5^ + z||q!||oo/(3V")^ 

Another useful Bernstein's inequality can be derived using the following trick from Birge and Massart 
(1998): since g{x) > 52(2^) for any a; > where 32(2;) = x + 1 — Vl + 2a;, and since 
92 ^iy) = + y> we have 



P" 



Z„(a) > 6V2^ + (Z^a)) < S^] < cxp{-x) (18) 



for any x > 0. Note that from (16), we can derive a uniform deviation inequality. 
Consider a family (Z„ (a) : a € A), where A is a set of bounded functions with finite 
cardinahty N. Since 4'n~s^{z) < z\\a\\oo/^/n+S^/2z (see above) we have, using Pisier's 
argument (see Section 2 in Massart (2007)), that 

Zn(a) > Sj2(liiN + x) + + ^) ^ (Z„(a)) < (5^ for some a£A 

V"- 

< exp(-a;). (19) 

In view of the next Lemma, we can remove the event {(Z„(a)) < S^} from the 
previous inequalities. Indeed, a consequence of (18) is the following. 
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Lemma 1. If a is bounded, we have for any x > 0: 



Zn{a) > c||a||V^ + (c+ 1)J 



< 2 exp(— a;), 



wh 



ere c — c\\ 



:= [V2(V2 + l)||ao||oo]i/'. 



Proof Since E[{J^ a{t, X)^Y{t)dt)^] < \\a^\\^ and \\ a{t, X)^Y{t)dt\\oo < \\a 
Bernstein's inequality for the deviation of the sum of i.i.d. random variables gives 



2 

OO ' 



pri 



a L, - a > 



< exp(— a;). 



(20) 



Take := \\a\\^ + \\a^V2^/^ + \\a\\l,x/n. We have P[||a||2 > SlJ < exp(-a;) 
and 

^n,x V ■^l I 1 1 oo-^ ^" ^ _ ^||oo||oo Va^ a + C|l 

Now, use (15) and (18) to obtain 



-,\H\l<Sl. <eM-x) 



for any a; > 0. This concludes the proof of the Lemma, by a decomposition over 
{||a||„ > (5„^x} and {||a||„ < Sn^x}- □ 

These deviation inequalities are the starting point of the proof of risk bounds 
for the algorithm of empirical risk minimization (ERM). Such a bound is given in 
Section 3 below, see Theorem 3. It requires a generalization of the bound (19) to a 
general set A, which is given in Section 3.2. 



3 Empirical risk minimization 

The very basic idea of empirical risk minimization (ERM) is the following. Since 
ao minimizes the risk a '—^ P{£a), a natural estimate of ao is a minimizer of the 
empirical risk a '—^ Pn{^a) over some set of function A, usually called a sieve. There 
is hope that such an empirical minimizer is close to aoi at least if ag is not far from A 
and if [P — Pn){f^a) is small (more details below). Also known as M-estimation, this 
algorithm has been studied extensively, see for instance Birge and Massart (1998), 
Vapnik (2000), van de Geer (2000), Massart (2007), Bartlett and Mendelson (2006), 
among many others. 

If no minimizer of the empirical risk exists, we can simply consider, as this is 
usually done in the literature, a p-minimizer according to the following definition. 

Definition 1 (p-ERM). Let p > be fixed. A p-Empirical Risk Minimizer (p-ERM) 
is an estimator an € A satisfying 

Pn{iaJ <p + M Pn{ia), 

where Pni^a) is the empirical risk (7). 
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For what follows, one can take p = 1/n, since typically, the risk of q;„ is larger than 
that. To prove a risk bound for the ERM, one usually needs a deviation inequality 
for 

C„(A) := sup(P-P„)(£„). 

However, when A is not countable, may be not measurable. This is not a problem 
since we can always consider the outer expectation in the statement of the deviation 
(see van der Vaart and Wcllner (1996)), or simply assume the following. 

Assumption 2. There is a countable subset A' of A such that almost surely, 

sup Pn{ia) = SUpP„(£„). 

aeA' aeA 

Moreover, assume that there is b > such that ||a||oo < i> for every a E A. 

The map a > Pn{^a) is continuous over C([0, 1]'*+-'^) endowed with the norm || • |loo- 
So, given that A C C([0, 1]'^"'"^), the first part of Assumption 2 is met. Note that this 
embedding holds in the examples considered in Section 5. The second part is rather 
unpleasant, but mandatory if no extra assumption is made on A, and since an 
metric is considered for the estimation of aQ. 

From now on, we take a* such that P{£a, ) = infaeA P{ia) (if no such a* exists, we 
can simply consider a* such that P{ia,) < infag^ P{(-a) + p)- Note that a* may not 
be unique at this point, we just pick one of the minimizers. The function a* is usually 
called the target function, or the oracle in learning theory, see Cucker and Smale 
(2002) for instance. 



3.1 Peeling 

A common way to prove a risk bound for the ERM uses the idea of localization or 
peeling (see for instance Massart (2007), Lemma 4.23 and van de Geer (2000, 2007), 
among others). The idea presented here is very close to these references. First, do a 
shift: take e > 0, and use the fact that is a p-ERM to obtain 

P(^aJ - P(^,J < (1 + e)p + P(£aJ - P{U) - (1 + e)(P«(^aJ - Pn(^aJ) 
< (l + e)p + ^„..(v4), 

where 

^nAA) ■■= sup f(l+e)(P-P„)(£a-CJ-eP(C-^ajy 
a£A ^ ' 

Then, for some constants (5 > and q > \, decompose the supremum over A into 
suprema over annuli Aj{S), where A[S) = {a g A : P{£a) — P{£a,) < <5}, and for 
j > 1, Aj{d) = {a & A : qH < P{£a) - P(C) < q^+^S}. Assume for the moment 
that there exists an increasing function : R+ M+ and Sj^in > such that for any 
a; > and S > S^i^, we have with a probability larger than 1 — Le~^: 

sup (P-P„)(^.-CJ<»±i^. (21) 

aeA{6) 

Such an inequality will be proved in Section 3.2 below. It entails that, with a proba- 
bility larger than 1 — Le^^: 

Vn j>i V y/n J 
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Assume further that is continuous, increasing, such that S i—s- ip{S)/S is decreasing 
and "0"^ is strictly convex. We can define the convex conjugate of ip~'^ as 

ip-^*{6) := sup{xS - ^J-\x)}. (23) 

a:>0 

The following Lemma comes in handy to choose a parameter 6 that kills the second 
term in the right hand side of (22). 

Lemma 2. Let ip : IR+ — > R+ be a continuous and increasing function and assume 
that ip~^ is strictly convex. If S :— tp^^* {2x / y) , we have 

x'ip{S) < y6 

for any x,y > 0. 
Proof. Simply write 

where the trick is to use the fact that uv < ■ip^^*{u) + ip^^{v) for any u,v > 0. □ 

Using Lemma 2 and the fact that tp{q^^^S) < q^^^il;{6), we obtain that for the 
choice 

we have, with a probability larger than 1 — Le^^: 
We have proved the following result. 

Proposition 2 (Peeling). Assume that (21) holds for any 5 > Smin, where ip : R+ 

is a continuous and increasing function such that is strictly convex and 
5 ^ 'ip{6)/5 is decreasing. If an is a p-ERM according to Definition 1, we have for 
any a; > 0: 

P{i&,^ < P(C.) + (l + e)p + e^„,e(x) 
with probability larger than 1 — ie"^, where 

A (\ ,-i.^ 2(l + £)g(l + ^/iV.x) ^ ^ 

In the next section, we prove Inequality (21) using the generic chaining mechanism, 
under an assumption on the complexity of A. 

3.2 Generic chaining 

The generic chaining technique, which is introduced in Talagrand (2005) is, in our 
setting, a nice way to prove (21). It is based on the j^{A,d) functional (see below) 
which is an alternative to Dudley's entropy integral (see Dudley (1978) for instance). 
The idea is to decompose A using an approximating sequence of partitions, instead 
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of nets with decreasing radius, as this is done in the standard chaining method. Let 
us briefly recaU some necessary notions that can be found in details in Talagrand 
(2005), Chapter 1. 

Let {A,d) be a metric space {d can be a semi-distance). Denote by A{A,d) 
sup^ fcG-B ^) diameter of A. An admissible sequence of A is an increasing 
sequence {Aj)j>o of partitions of A (every set of Aj+i is included in a set of Aj) such 
that \Aj\ < 2^^ and |.4o| = 1. If a £ A, we denote by Aj{a) the unique element of Aj 
that contains a. For v > 0, define the function 

7^(A,rf) infsup^2J/''A(Aj(a),d) (24) 

where the infimum is taken among all admissible sequence of A. This quantity is an 
alternative (and an improvement, see Talagrand (2005), in particular Theorem 3.3.2) 
of the Dudley's entropy integral (see for instance van der Vaart and Wellncr (1996)). 
Indeed, we have: 

pA{A,d) 

lM,d)<L (logiV(A,e,d))i/''d£, (25) 

Jo 

where N(A, s, d) is the covering number of A, namely the smallest integer N such 
that there is B C A satisfying \B\ < N and d{a, B) < e for any a G A. 

Introduce d2{a,b) :— \\a — 6|| where || • || is the semi-norm given by (10) and 
doo{a,b) — lla — 6||ooi where || • \\^ is the uniform norm (4). Using the generic 
chaining argument, we obtain the following deviation inequality. 

Theorem 1. Grant Assumptions 1 and 2. For any x > we have 

sup V^(P - P„)(^„ -lc..)< c(72(Arf2)(l + V^) + 7l(Arfoo)^^) (26) 

with a probability larger than 1 — _Le~^, where c = Cbjia^y^ = 4(6 + ||ao||co) + 
2{[V2{V2 + l)\\ao\\^]^/^ + 1) {and L « 1.545433). 

The proof of Theorem 1 is given in Section 6 below. In (26), the function 72 is 
related to the subgaussian term of the Bernstein inequality (18), while 71 is related to 
the subexponential term. However, if we have an extra condition on the complexity 
of A, it is possible to "remove" the 71 term from (26). This is called the adaptive 
truncation argument, which is related to the use of brackets (instead of balls) to 
construct a covering of A. 

3.3 Brackets 

Entropy with bracketing has been introduced by Dudley (1978). The adaptive trun- 
cation argument was introduced by Bass (1985) for partial sum process and Ossiander 
(1987) for the empirical process. We refer to van de Geer (2000) (in particular the 
proof of Theorem 8.13) herein and Massart (2007) (see the proof of Theorem 6.8) 
for the use of this technique with statistical applications in mind. In the context of 
generic chaining, bracketing can defined as follows. Following Talagrand (2005) (see 
in particular Theorem 2.7.10), we consider 

7a (A) :=infsup^2-'/2||^^^.(,)||, (27) 
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where the infimum is taken among all admissible sequences of A, where we recall that 
II • II is defined by (10), and where 

^a{z) := sup \a{z) — a'{z)\ 

for any z £ [0, 1]^*+^ If a^,a^ e A, the bracket [a^,a^] is the band 

[a^,a'^] -.^^ {a e A : < a < a'^ pointwise}. 

The quantity \\a^ — a^|| is the diameter of the bracket. We denote by N^\A,e) the 
minimal number of brackets with diameter not larger than e necessary to cover A. 
Analogously to (25), one has 

-f^\A)<L (log7Va(^,e))i/2de. (28) 

Jo 

Entropy with bracketing is a refinement of L°°-entropy, that can be suitable for some 
class of functions, for instance functions with uniformly bounded variation, see for 
instance van de Goer (1993) and Bitouze ct al. (1999). In our setting, it is useful to 
"remove" the 71 term from (26), thanks to the following result, which is Talagrand's 
version of the adaptive truncation argument. 

Theorem 2 (Talagrand (2005), Theorem 2.7.11). Let A he a countable set of mea- 
surable functions, and let u > 0. If ^^{A) < F, we can find two sets Ai,A2 with the 
following properties: 

• 72(^1,^2) < LV, 71(^1, t^oo) < LuT, 

• 72(^2,^2) <ir, 71(^2, doo) < LuT, 

• for any a € ^2, we have a > and \\a\\i < LT/u, and 

Ac Ai+ A2, where A^ = {a' : 3a e A2, |a'| < a}. 

Indeed, an immediate consequence of Proposition 1 and Theorem 2 (simply take 
u — \fn in Theorem 2) is the following. 

Corollary 1. Grant Assumptions 1 and 2. For any x > Q, we have 

sup V^iP-P„){ia-ia,) < C7[l(^)(l + V^Vx) 

with a probability larger than 1 — Le^^ , where c = HcqH^ is the same as in Theo- 
rem 1. 

3.4 A risk bound for the ERM 

Corollary 1 is close to the concentration inequality (21) required in the peeling argu- 
ment, see Proposition 2 above. However, note that the peeling was done using sets 
A{5) = {a £ A : P{£a) — P{ia,) < S} for S > 0, while we can bound from above 
the entropy (and consequently the functionals 7 and 7O) of balls B(6) — {a G A : 
\\a — a* II < S} using a standard result (see Section 3.5). Hence, it will be convenient 
to work under the following assumption. 
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Assumption 3. Assume that \\a — q:*|P < P{£a) — P{ia,) for every a ^ A. 

This assumption is a bit stronger than the standard margin assumption, see 
Manimcn and Tsybakov (1999); Tsybakov (2004), or the fi-Bernstein condition, see 
Bartlctt and Mendelson (2006) [Note that here /3 = 1, as in most statistical models, 
see Lecue (2007).] Indeed, let us prove that Assumption 3 entails, together with 
Assumptions 1 and 2: 

P{{£a - £a,f) < cP{ia ~ ia,) for every a € A, (29) 

where c = Cbj|cQ||^ := 8{{b + ||q!o||oo)^ + ||ao||oo), which is the (1, c)— Bernstein con- 
dition from Bartlett and Mendelson (2006). We have using (2): 

UX, (Yt), (Nt)) = e'^iX, (Yt)) - 2 / a{t, X)dM{t), 



where £^ is the loss function 

a^Ayt)) J a{t,xfy{t)dt~2 I a{t, x)ao{t, x)y{t)dt, 
so the following decomposition holds: 



£„(A, {Yt),{Nt)) ~ e^AX, (Yt), (Nt)) 



£'^iX,iYt))-CAX,{Yt)) + 2 / {a,{t,X)-ait,X))dM{t) 







{ait,X) - a4t,X)){ait,X) + a4t,X) ~ 2ao{t, X))Y{t)dt 

+ 2 [ {a4t,X) - a{t,X))dM{t). 
Jo 

Hence, using Assumptions 1 and 2, we have: 

P((^a - ^aj') < 8(6+ ||ao||oo)'||a - 

L 

{aAt,X) - a{t,X))^ao{t,X)Y{t)dt 

< 8((fo + llaolloo)^ + l|aol|oo)||a - 

and (29) follows using Assumption 3. Now, let us show that Assumption 3 is mild: 
it is met when A is convex, for instance. The fact that convexity entails the margin 
assumption is true in most statistical models, such a in regression, see for instance 
Lee et al. (1998). 

Lemma 3. Grant Assumption 1 and let A be a convex class of functions bounded by 
b > 0. Then, Assumption 3 is met. 

Proof. Since A is convex and P{a^,) ~ iidaeA P{£a), we have (a* — uq, a*, a) < for 
any a & A, where we recall that the inner product is given by (10). This entails 

P{£o.-£c.) = 2(a,ao) - \\a,f + 2{a,,an) 

— \\a ~ a^W^ — 2{a.j, ~ ao, a* — a) 
> \\a-aJ\^.n 
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We are now in position to state the following risk bound for the ERM, under a 
condition on the complexity of A. 



Theorem 3. Grant Assumptions 1, 2 and 3. Assume that there is Jmin > a 
continuous and increasing function tp : R"*" — > R"*" such that for any S > (Jmin, any 
a' E A and any Vd-ball B{VS) = {aeA:\\ a — a' 1 1 < S}, we have either: 

ip{S)>j2{B{VS),d2) + ^-fi{B{VS),doc) for any 5 > (5,„i„, 



or: 



ip{5) > for any d > S„ 



Assume further that (p ^ is strictly convex and that S > (p{S)/S is decreasing. Then, 
if an is a p-ERM according to Definition 1, we have for any e > 0, a: > 0: 

P{iaJ < Pi£a.) + {l + e)p + eSnA^) 

with a probability larger than 1 — Le^^, where 

6n,e{x):^ip 1= Vdinin, 

and c = Cb,\\an\\oa same as in Theorem 1. 

Proof. Because of Assumption 3, we have A{5) C B{^/S), so Inequality (21) is sat- 
isfied under the assumptions of the theorem with ip{S) = cp{S), using Theorem 1 or 
Corollary 1. Hence, we can apply Proposition 2, which entails the Theorem since 

ip-^*{x) ^ip-^*{cx). □ 

Remark 2 (Comparison). This bound for the ERM is of the same nature as previ- 
ous bounds for the ERM in more "standard" models, such as density, regression or 
classification. The rate given in Theorem 3 gives, on examples, the same rate (up 
to constants) as the one given in Massart (2007) (see Theorem 8.3), for instance. 
Consider the situation where (p{6) = c6" for c > and a G (0, 1) {(p{S) is of order 
V DS when A has a finite dimension, see Section 3.5 below). In this case, we have 
ip-^*ix) = (1 - a)a"/(i-")(ca;)i/(i-"), so (5„,,(a;) is of order (c/V7I)i/(i""). The 
rate in the bound by Massart is solution to the equation y/ne'^ ~ ¥'(£^)j hence 
^* — (c/-yn)^/*^^~"\ and both rates have the same order. 

Remark 3 (Talagrand's inequality). Usually, the complexity of the sieve A is measured 
by the functional <j){B) = E"[sup„g5(F — P„)(^a — ^a.)] where B are balls in A, 
like in Massart (2007) or spheres in A, see Bartlett and Mendelson (2006). The rate 
is then the solution of a fixed point problem involving these functional, such as, 
roughly, the equation (p{B{e)) — y/ne'^ from Massart (2007). Note that the main tool 
in the proof of these results is Talagrand's deviation inequality, see Massart (2000), 
Rio (2001) or Bousquet (2002). In Theorem 3, we were not able to state the bound 
with a rate defined in such a way. Indeed, we needed a "stronger" control on the 
complexity, given by the 7 functionals, to define Sn,e- This is related to the fact that 
we cannot use a Talagrand's type deviation inequality in the general model (2) for 

sup(P - P„)(^„ -io.,)~ E"[sup(P - P„)(£, - e^j]. 

aeA aeA 
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Indeed, a{t, X)dM{t) is not, in general, bounded (think of the Poisson process for 
instance, which is a particular case of Section 1.2.2). 

However, the story is different when N{t) is bounded, such as in the models of 
regression for right-censored data, and of transition intensities of Markov processes 
(see Sections 1.2.1 and 1.2.3). It is then possible to use the strength of Talagrand's 
inequality, following the arguments from Bartlett and Mendelson (2006) (up to sig- 
nificant modifications, since the analysis is conducted in the regression model). 

A case of importance (particularly in practice) is when A is included in a linear 
space A with a finite dimension D (see Birgc and Massart (1998) and Massart (2007) 
for instance). Using the version of Massart (2007) of a classical result concerning L°°- 
coverings of a ball in such a space (see below), we can show that 5n,<i{x) is smaller 
than a quantity of order D /n. This will be useful to compute rates of convergence in 
Section 5 below. 



3.5 When A is finite dimensional 

Let us now consider the case where ^ is a subset of some linear space A <Z C\ 
L°° [0, IJ^'^^ with finite dimension D. Following Birgc and Massart (1998) and Barron et al. 
(1999), we can consider the -index 

.Tx 1 . f II EAgA/^A^Alloo 

r[A) —j= ml sup — , 



oo 



where |/3|oo = niax^gA |/3a| and where the infimum is taken over all orthonormal basis 
{4>x:Xe A} of A. 

This index can be estimated for all the linear spaces usually chosen as approxima- 
tion spaces for adaptive estimation, see Birge and Massart (1998) and Barron et al. 
(1999). In particular, if A is spanned by a localized basis, then r{A) can be bounded 
independently of D (think of a wavelet basis for instance, more on that in Section 5 
below) . 

Using this index, we can derive a bound for "f^\B{^/6)). For any e G (0,(5], the 
following holds (see Massart (2007), Lemma 7.14): 

N{B{S),e,doo)<(^^^^Y, (30) 



where L can be y'37re/2. But, using (28) together with the fact that N^\A, e/2) < 
N{A, e, doc), we obtain 

7[1(S((5)) < ^In (^^:!l^)de < LS^D{\nr{A} + 1). (31) 

So, we have the control required in Theorem 3: 7[l(i?(v^)) < Lip{5), with 

ip{5) = yf5^D{\nr[A) + 1), 
which is a function that satisfies the assumptions of Theorem 3. Note that 



_i x^DHr{A) + l) 
if [x) = , (32) 



so we have the following. 
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Corollary 2. Grant Assumptions 1 and 2, and assume that A d A, where A is a lin- 
ear space with finite dimension D. Then, is a p-ERM according to Definition 1, 
we have for any e > and x > 0: 

PH. J < PiiaJ + il + e)p+ <^ + ^rHr{A) + l) D ^ ^ ^ 

e n 

with a probability larger than 1 — Le^^, where c — C6.||qo||oo- particular, we obtain 

E"||a„ -aolp < 2p+ inf ||a - ao|P + cln(r(^) + 1)-. (33) 

aeA n 

Proof. Note that Assumption 3 is met since A is linear. So, Theorem 3 together 
with (32) gives the first inequality. The second inequality follows by choosing e = 1, 
by subtracting P{ia„) at both sides of the inequality, and by integration with respect 
to X. □ 

The next step is, usually, to have a control on the approximation or bias term 
infagA ||cK — ckolPj and to choose a sieve with a dimension that equilibrates the bias 
term with the "variance" term D/n, hence the name bias-variance problem, see 
Cucker and Smale (2002) for instance. Usually, this is done using the assumption 
that ao belongs to some smoothness class of functions, together with some results 
from approximation theory. This is where the problem of adaptive estimation arises: 
the choice of the optimal D depends on the parameters of the smoothness class itself, 
which is unknown in practice. So, one has to find a procedure with the capability 
to select automatically a sieve or a model among a collection {Am : m e A^}. This 
is usually done using model-selection, see the seminal paper Barron et al. (1999). 
Model selection in the setup considered here has been studied in Comte et al. (2008). 
In Section 4 below, we consider an alternative approach, based on a popular aggrega- 
tion procedure. It will allow the construction of smoothness and structure adaptive 
estimators, see Section 5. 



4 Agnostic learning, aggregation 

Let A ~ A{K) := {a\ : A e A} be a set of arbitrary functions called dictionary 
with cardinality M. For instance, this can be a set of so-called weak estimators, 
computed based on a set of observations independent of the sample Z?„. We consider 
the problem of agnostic learning: without any assumption on ao, excepted for some 
boundedness assumption, we want to construct (from the data) a procedure d„ with 
a risk as close as possible to the smallest risk over A. Namely, we want to obtain an 
oracle inequality of the form 

E"||a„ — ckolP < cmin \\a — ao|P + 0(f^, M), 

aeA 

where c > 1 and (j){n, M) is called the residue or rate of aggregation, which is a 
quantity that we want to be small as n increases. An oracle inequality that holds 
with c = 1 is called sharp. 

This problem has been considered in several statistical models, mainly in regres- 
sion, density and classification, see among others Nemirovski (2000); Catoni (2001); 
Juditsky et al. (2006); Leung and Barron (2006); Dalalyan and Tsybakov (2007); Yang 
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(2000); Audibcrt (2009). For instance, we know from Tsybakov (2003) that the opti- 
mal rate of aggregation in the Gaussian regression model is 4>{n,M) = (logAf)/n (in 
the sharp oracle inequality context). This rate is achieved by the algorithms of ag- 
gregation with cumulative exponential weights, see Juditsky ct al. (2006); Audibert 
(2009) and aggregation with exponential weights, see Dalalyan and Tsybakov (2007) 
(when the error of estimation is measured by the empirical norm, a similar result for 
the integrated norm is, as far as we know, still a conjecture). 

Aggregation with exponential weights is a popular algorithm. It is of importance 
in machine learning, for estimation, prediction using expert advice, in PAC-Bayesian 
learning and other settings, see Cesa-Bianchi and Lugosi (2006), Audibert (2009) and 
Catoni (2001), among others. However, there is no result for this algorithm in the 
general model (2), nor for any of the particular cases given in Section (1.2). In this 
Section, we construct this algorithm for model (2), and give in Theorem 4 below an 
oracle inequality. 

The idea of aggregation is to mix the elements from ^(A): using the data, compute 
weights 9{a) G [0,1] for each a G A{A) satisfying X^asA^C*^^) ~ ^- These weights 
give a level of significance to a. The aggregate is the convex combination 

an ■■= ^ 9{ax)a\, (34) 
AeA 

where the weight of a G ^(A) is given by 

, ^ exp ( - nPnUc) /T) , , 

EagaGxp ( - nP„(^„J/r) 

where T > is the so-called temperature parameter and where we recall that 

Pniic.) = -y] [ a{t,X,fY\t)dt--y\ [ a{t,X,)dN\t) 

is the empirical risk of a. The shape of this mixing estimator is easily explained. 
Indeed, the weighting scheme (35) is the only minimizer of 

Rn{0) + -y2e),\og9x (36) 

AeA 

among all G (we use the convention log = 0) where 

e := {61 G M}^ : e^A > 0, ^9x = l}, 
AeA 

and where R„(0) is the linearized empirical risk 

AeA 

Equation (36) is the linearized risk of G 0, which is penalized by a quantity pro- 
portional to the Shannon's entropy of 0. The resulting aggregated estimator d„ is 
then something between the ERM among the elements of ^(A) (when T is small), 
and the mean of the elements of A{A) (when T is large). 
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Theorem 4. Assume that ||q!o|1oo < +00, and that there is b > such that |1q;|1oc- < b 
for any a € ^(A). Then, for any e > 0, the mixing estimator ctn defined by (34) 
satisfies 

E"||a„ - «o||' < (1 + e) inf \\ax - aof + 

AeA n 

for any n>l, where c = Cb^\\ao\\^.T,e- 

Theorem 4 is a model-selection type oracle inequality for the aggregation proce- 
dure given by (34). The residual term in the oracle inequality is of order (log M)/n, 
which is the correct rate of convex aggregation, see Tsybakov (2003) (in the Gaussian 
regression setup, and for other models with margin parameter equal to 1, see Lecue 
(2007)). 

Remark 4. The main criticism one can make about Theorem 4 is that it is not sharp: 
the leading constant is 1 + e instead of 1 in front of inf^gA ||Q!a — ctolPj and the constant 
c in front of the residue is far from being optimal. The consequence is that we are not 
able in this setting to give a theoretically optimal value for T. Sharp oracle inequalities 
are available for aggregation with exponential weights or cumulative weights, see 
Dalalyan and Tsybakov (2007), Juditsky et al. (2006) and Audibert (2009), see also 
references mentioned above. However, in the setup considered here, the proof of a 
sharp oracle inequality seems quite challenging, and will be the subject of further 
investigations. 



5 Structure and smoothness adaptive estimation 

In this Section, we propose an application of the results obtained in Sections 3 and 4. 
We construct an estimator that adapts to the smoothness of ao in a purely non- 
parametric setting, see Section 5.1, and to its structure in a single- index setup, see 
Section 5.2. The steps of the construction of the estimator are given in Definition 2 
below. As usual with algorithms coming from statistical learning theory, we need 
to split the sample (a very particular exception can be found in Leung and Barron 
(2006)). To simplify, we shall assume that the sample size is 2n, see (3), so D2n is 
the full sample. 

Definition 2. The steps for the computation of an aggregated estimator Q!„ are the 
following: 

1. split the whole sample (see (3)) into a training sample D„.i of size n and 
a learning sample Dn^2 of size n; 

2. choose a collection of sieves {Am ■ m e Mn} and compute, using the 
corresponding empirical risk minimizers {a„i : m g Mn} (see Definition 1); 

3. using the learning sample I?„^2, compute the aggregated estimator d„ based on 
the dictionary {dm ■ m £ VW„}, see (34) and (35). 

Examples of collections {Am '■ rn E Mn} are given in Appendix A.l, together with 
the necessary control of the L°°-index (see Section 3.5), and a useful approximation 
result. 
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Remark 5 (Jackknife). The behavior of the aggregate d„ typicaUy depends on the 
spht selected in Step 1, in particular when the number of observations is small. Hence, 
a good strategy is to jackknife: repeat, say, J times Steps 1-3 to obtain aggregates 
jCil,''^}, and compute the mean: 



This jackknifed estimator should provide more stable results than a single aggregate. 
Moreover, by convexity of the risk a i-^ P{ia), the jackknifed estimator satisfies the 
same risk bounds as a single aggregate. 

5.1 Adaptive estimation in the purely nonparametric setting 

In model (2), the behaviour of aQ{t, x) with respect to time t and with respect to the 
covariates x have no statistical reason to be linked. So, in a purely nonparametric 
setting, it is mandatory to consider anisotropic smoothness for ao. We shall assume 
in the statement of the upper bound, see Theorem 5 below, that ao S -Bfo^, where 
ao G i?| o2 is an anisotropic Besov space (see Appendix A.l) and s = (si, . . . , Sd+i) 
is a vector of smoothness, where Si is the smoothness in the ith coordinate. For the 
construction of the adaptive estimator, see Step 2 above, we need a collection of sieves 
{Ajn : TO e Mn}- 

Definition 3 (Collection). We take {A'^ : to G Mn} as: 

• a collection of linear spaces spanned by piecewise polynomials (see Section A. 1 . 1) , 
with degrees not larger than li in the ith coordinate, or 

• a collection of linear spaces spanned by wavelets (see Section A. 1.2) with k 
vanishing moments in the ith coordinate. 

In both cases, we say that (^i, . . . , Id+i) is the smoothness of the collection, and we 
take 



Finally, we fix a constant b > and take Am :— {a E A',„ : ||a||oo < b} for every 



For the statement of the adaptive upper bound, we need the following assumption, 
which is a stronger version of the previous Assumption 1. 

Assumption 4. Assume that Px has a density fx with respect to the Lehesgue 
measure, which is bounded and with support [0,1]'*"''^. Moreover, we assume that 
||q;o|!oo < b, where b is known (it is used in the definition of the sieves, see Defini- 
tion 3). 

Now, we can use together Corollary 2 (see Section 3.5), Theorem 4 (see Section 4) 
and Lemma 5 (see Appendix A.l) to derive an adaptive upper bound. Take p = \/n 
in Corollary 2 and, say, e = 1 in Theorem 4, to obtain 




Mn := {{mi 



, • ■ • , TOd+i) e N'^+i : 2™- < Tii/^'^+i) for i = 1, . . . , d + 1}. 



mE M 



n • 
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where for m = (mi, . . . , m^+i) G A^n, 

D„^ := n^"b = n2"^ 

Let us assume that ao G -Bf oo? where s = (si, . . . , s^+i) satisfies > {d + l)/2 
for each i = 1. Assumption 4 entails ||a|P < H/xllooljalli foi' ^'^Y o: € 

L2[0,1]'^+\ where ||a||| = /[^ ^j, Q!(t, a;)2dida;. So, using Lemma 5, we have when 

p2n|r l|2 ^ /^\^n-2^, , Hj^ 1 -^m, log|.M„| >^ 

E ||a„-ao|| <ci^}_^D^^^^+ + j, 

where c = Cb^T,s,d,||/xi|oo,|ao|B| ^ ■ ^^0*^ that (log|A^„|)/n < Cd{\ogn)/n, so the rate 
of convergence is given by the optimal tradeoff between the bias and the variance 
terms. Since Si > {d+l)/2 for any i = 1, . . . , d+ 1, we have n«A.(2^+d+i) < n^/^'^+i), 
so we can choose m = (mi, . . . , m^+i) G Mn such that 

2™'-i < n^^TTTi < 2"' for i = 1, . . . , d + 1. (37) 

This gives 

d+i V\'^+^ n 

y ^-^2., ^ ^ ^^^-2./(2.+d+l)^ 

J = l 

SO we proved the following theorem. 

Theorem 5. Grant Assumption 4, and consider a collection {Am '■ rn G Mn} given 
by Definition 3 with smoothness (li, . . . ,ld+i). Assume that ao G Sfoo where s = 
(si, . . . , Sd+i) satisfies {d + l)/2 < < li for each i = 1, . . . ,d + 1. Then, if cXn is 
the aggregated estimator given by Steps 1-3, we have 

E2"||d„-ao||' <cn-2«/(2^+rf+i), 

where c = C6,T,s,d,||/x|U,|aolB|_^ • 

The rate 77,-2s/(2s+d+i) ^Yyq optimal rate of convergence (in the minimax sense) 
in this model, under the extra assumption that fx is bounded away from zero on 
[0, 1]'', see Theorem 3 in Comte et al. (2008). Hence, Theorem 5 shows that d„ adapts 
to the smoothness of ao over a range of Besov spaces -Bf for [d + \) /2 < Si <li. 

5.2 Dimension reduction, single-index 

The mark X is d-dimensional so the intensity ao takes d + 1 variables. As with any 
other nonparametric estimation model, we know that when d gets large the dimension 
has a significant impact on the accuracy of estimation. This the so-called curse of 
dimensionality phenomenon, which is reflected by the rate j^-2s/(2s+d+i)^ Theo- 
rem 5 above. This rate is slow if d is large compared to s. In this Section, we propose 
a way to "get back" the rate n~^*/(^*+^\ using single-index modelling. Thanks to 
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our approach based on aggregation, we are able to construct an estimator that au- 
tomatically takes advantage (without any prior testing) of the single-index structure 
when possible: the rate is then 77,-2*1/(23+2)^ otherwise it is the the purely nonpara- 
metric rate 77-2s/(2s+d+i) xhis idea of mixing nonparametric and semiparametric 
estimators can be also found in Yang (2000) for density estimation. 

Dimension reduction techniques usually involves an assumption on the structure 
of the object to be estimated. Main examples are the additive and the single-index 
models. Additive modelling was proposed by Linton ct al. (2003) in the same context 
as the one considered here, with very different techniques (kernel estimation and back- 
fitting). In this paper, we focus on single- index modelling (see Remark 6 below). On 
single-index models (mainly in regression) and the corresponding estimation problems 
(estimation of the link function, estimation of the index), see Hristache et al. (2001), 
Delecroix et al. (2003), Xia and Hardle (2006), Delecroix et al. (2006), Geenens and Delecroix 
(2005), Gaiffas and Lecue (2007), Dalalyan et al. (2008) among many others. The 
single-index structure is as follows: assume that there is an unknown function (3o : 
IR+ X K. ^ (called link function, with has unknown smoothness here) and an 
unknown vector vq S Mf^ (called index) such that 

an{t,x) = (3o{t,vJx). (38) 

In order to make the representation (38) unique (identifiability), we shall assume (see 
Assumption 5 below) that vq € S'l~^, where 5'^"^ is the half-unit sphere defined by 

5+"^ = {w e M'' : |w|2 = 1 and t;d > 0}, (39) 

where | • I2 is the Euclidean norm over R"*; 

The steps of the construction of the adaptive estimator in this context follows the 
ones from Definition 2, but the dictionary is enlarged by a set {cim'^J^ : m e Aif^^, v £ 
S^^}, of empirical risk minimizers, where 5*^"^ is a A-net of 5^"^. So, compared to 
Section 5.1, the idea is simply to add estimators that works under the single- index 
assumption in the dictionary. 

Definition 4. The steps for the computation of the aggregated estimator d„ are the 
following: 

1. split the whole sample Z?2n (see (3)) into a training sample Dn^i of size n and 
a learning sample -D„.2 of size n; 

2. Compute a A = (rilogn)~^/^-net of the half-unit sphere S'l~^ denoted by S'^^ 
and for each v G S"^"^ compute the pseudo-training samples 

DnAv) := [{v'^ X,, N\t),Y\t)) : t e [0,111 < I < n], (40) 

where the d-dimensional marks Xi are simply replaced univariate marks f^X^. 

3. Fix a collection of 2-dimensional sieves {d — 1) {^m'^ : m E Aif^^} given 
by Definition 3. Compute, for every m € Mf^^ and v £ S"^^^, empirical risk 
minimizers (3^^, over ^m'^, of the empirical risks 

Pnhia)^-y2[ c.{t,v^X,fY\t)dt--y2 I a{t,v'^X,)dN\t), 
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and define 

(so that each a^J"^ works as if w were the true index). 

4. follow Steps 2 and 3 from Definition 2, where we add the estimators {Z?™'^ : m £ 
Aif^^, V € S'^^} to the set of purely nonparametric estimators {am ■ ™ G -Mn} 
in the aggregation step. 

An important point of this algorithm is that we do not estimate the index directly: 
we mix estimators in order to adapt to the unknown vq and to the unknown smooth- 
ness of Pq. This approach was previously adopted in Gaiffas and Lecue (2007) for 
the estimation of the regression function. Note that the size of increases strongly 
with n and d, so this method is restricted to a reasonably small d. High dimensional 
covariates cannot be handled in such a semiparametric approach, this problem will 
be the subject of another work. About high dimension, see Tibshirani (1997), where 
the LASSO has been studied in the Cox model. 

The following set of assumptions gives the identifiability of model (38) (see for 
instance the survey paper by Gecnens and Delecroix (2005), or Chapter 2 in Horowitz 
(1998)), excepted for (41) and (42) which are technical assumptions. 

Assumption 5. Assume that (38) holds, and that 

9 X /3Q{t,x) is not constant over the support of vj X ; 

• X admits at least one continuously distributed coordinate (w.r.t. the Lebesgue 
measure); 

• the support of X is not contained in any linear subspace ofW^; 

• there is Cq > such that for any x,y E [0, l]'^, any t > 0: 

\(3oit,x)-Po{t,y)\<co\x-y\; (41) 



there is bo > such that 



inf l3o{t,x) > bo. 

(t,x)G[0,l]<'+i 



(42) 



Remark 6. In the problem of estimating the intensity of a counting process in presence 
of covariates, two of the most popular models are special cases of the single-index 
model, as described in Equation (38): 

• the Cox model (see Cox (1972)), where there exists an unknown function /3o 
such that: 

ao{t, x) = l3o{t) expivjx). (43) 

and 

• the Aalen model (see Aalen (1980)), which can be written as: 

ao{t,x) ^ l3Q{t)+vJx. (44) 
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This emphasizes the relevance of considering single-index models in this context, and 
the use of anisotropic smoothness. This paper is only a first step in this direction, for 
the expected rate of convergence in these two models would be 7j-2s/(2s+i) t^]jqii the 
link function has smoothness s in some sense. Adaptive estimation by aggregation, 
including the Cox and Aalen models, will be addressed in a forthcoming paper. 

Theorem 6. Grant the same assumptions as in Theorem 5 and let a„ be the aggre- 
gated estimator from Definition 4- 

• If Assumption 5 holds (single-index) with /3o G ^^1 oc where s — (si, S2) satisfies 
1 < Si < li for i — 1,2, we have 

E2"||a„--ao||'<cn-2^/(2s+2) 

forn large enough, where c = Cb^T,s.d,\\fx\\^,\f3o\B^ ^,voM,co- 

• Otherwise, we have, when ao G ^Iooj where s = (si, . . . , s^+i) satisfies (d + 
l)/2 < Si < li for each i — 1, . . . , d + I, that 

E2"||a„-ao|P<cn-2s/(2«+<i+i)^ 

for n large enough, where c = Cb^T,s,d,\\fx\\^,\ao\B^ ^ ■ 

The proof of this theorem is given in Section 6. This theorem proves that 
adapts to the smoothness of the intensity, and to its structure: if the single-index 
model (38) holds, then the rate is n"^*'/^^*"'"^^ which is the optimal rate when X 
is one-dimensional. Otherwise, the rate of convergence is 7j-2s/(2s-i-£i+i) ^j^Qy^ ^j^g 
covariate is d-dimensional. Of course, this result is not surprising, since any kind of 
estimator can be used in the dictionary to be aggregated. However, note that the 
proof of Theorem 6 involves a technical tool concerning counting processes, namely 
a concentration inequality for the likelihood ratio between two indexes in 5^^^, see 
Lemma 4 in Section 6. 

6 Proofs 

Proof of Proposition 1 

Proof of Proposition 1. Let us define the process 

^ n „t n 

Z4a,t) := — ^ / aiu,X,)dAPiu) :=^Z;(a,i), 
V i=i -^0 ^=1 

so that Zn{a) — Z„(a,l). The predictable variation of AP is given by {M^{t)) ~ 
Jp aQ{u,Xi)Y^{u)du, so we have 

(Z,;(a,<))-- / a{u,X,fafi{u,X,)Y\u)du 
n Jo 

for any t E [0, 1]. Moreover, we have AM*(i) £ {0, 1} for any i = 1, . . . ,n since the 
counting processes TV* have an intensity. We can write — Zlf + Zli'^ where Zlf 
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is a continuous martingale and where Z^^'^ is a purely discrete martingale (see e.g. 
Liptscr and Shiryayev (1989)). Let h > he fixed and define Ul{t) := hZ^^{a,t) — 
S1{t), where S1{t) is the compensator of 

lh^{Zi;^{a, t)) + {expih\AZl{a, s)\) - 1 - h\AZiia, s)\) . (45) 

s<t 

We know from the proof of Lemma 2.2 and Corollary 2.3 of van dc Geer (1995), see 
also Liptser and Shiryayev (1989), that exp([/^(i)) is a super-martingale. Then, if 
Sh Eti SI Un := ELi ^e have 

< (E"[e25^(i)l<^„(„)><,.])^/^ (46) 

The last inequality holds since exp(J7^(i)) — exp{hZ^{a,t) — 5^(t)) are independent 
super-martingales with C/^(0) = 0, so that E[exp(2J7^(t))] < 1, for i = 1, . . . , n. Let 
us decompose M* = M*''^ -I- M*''', with M''^ a continuous martingale and AP^'^ a 
purely discrete martingale. The process V^it) := {M'^{t)) is the compensator of the 
quadratic variation process [M^t)] = {M'''''{t)) + Y.s<t |AM*(t)|2. If A: > 3, we define 
Vl{t) as the compensator of the fc-variation process X]s<t \A.M^{t)\^ of A'P{t). Since 
AM'(t) e {0,1} for all < t < 1, the are all equal for A; > 3 and such that 
V'jf(t) < V'2*(i), for all fc > 3. The process Sj^{t) has been defined as the compensator 
of (45). As a consequence, we have: 

k>2 ■ ^ 

<{a(..Xd=.VJ(.)x|:M£!(-|)' 

and if (Z„(a)) < (5^ 

Thus, plugging this in (46) gives 

^nAh) < exp ^ 

for any /i > 0. Now, choosing 

\\a\\oo V 0^\/r 

entails (16). □ 
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1 

2„ 



Proof of Theorem 1 

Proof of Theorem 1. First, note that (2) entails 

(Ft), [Nt)) = {Yt)) - 2 / X)dM{i), 

where t'^ is the loss function 

t'^{x,{yt)) ■■= / a{t,xYy{t)dt-2 I a{t, x)ao{t, x)y{t)dt. 
Jo Jo 

So, the following decomposition holds: 

where we recall that Zn{-) is given by (14), and where P{i'a) ■— E[£^(X, (Yt))] and 
■= Ti^^^i^'ai-^iA^t ))- First, let us prove the concentration inequality for 
sup„g^(Z„(a*) — Zn{a)). The proof follows the hues of the proof of Theorem 1.2.7 
in Talagrand (2005). Consider admissible sequences {Bj)j>o and {Cj)j>o such that 

J2VA{B,{a),d^) <2ji{A,doo) and ^ 2^/2^(Q(a), da) < 272(A, da) 

for any a E A. We construct partitions Aj of A as follows. Set Ao = {A} and for 
J > 1, Aj is the partition generated by Bj^i and Cj_i , namely the partition consisting 
of every set P n C where B e Bj-i and C € Cj_i. Note that < (22'"')2 = 2^' 
so that {Aj) is admissible. Define a sequence (Aj)j>o of increasing subsets of A 
by taking exactly one element in each set of Aj. Such a set Aj is then used as an 
approximation of A, and is such that \Aj\ < 2^\ Define nj{a) by the relation 

A,nA,(a) -{7r,(a)}, 

and take 7ro(Q:) = a*. In view of Lemma 1, we have with a probability larger than 
l-2exp(-(.x + 2J+i)): 



Zn{-Kj^i{a)) - Zn{TTj{a)) < Cod2{TTj{a),Trj^i{a))\/x + 2^+ 

+ {Co + 1) 



d^{TTj{a),TTj-i{a)){x + 2^+^) 



Now, for a fixed a £ A, decompose the increment Zn{oi*) — Zn{a) along the chain 
(7rj(a))i>o: 

Z„{a^^.) - Zn{a) = ^ {Zn{TTj-l{a)) - Zn{TTj{a))), 

and note that the number of pairs {'Kj{a),T:j-i{a)} is at most 2^^ x 2^^ < 2^^^ . 
This gives, together with union bounds for each term of the chain: 



sup(Z„(a*) - Zn{a)) < sup V (CqVx + 2^+1^2 (tTj (a), 7rj_i(a)) 

+ ^^^{x + 2-'"+i)do,(^,(a),^,_i(a))) 

A/ 71 / 



aeA j>i 
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with a probability larger than 1 — 2^^^^ 2^^^^ exp(— (x + 2^+^)) > 1 — Lexp{—x) 
(with L fa 0.773). But, for any j > 2, TTj{a),TTj-i{a) G Aj-i{a) C i?j_2(a), so 
doo{Trj{a),Trj^iia)) < A(Bj_2(a), rfoo) and (ioo(7ri(a), 7ro(Q;)) < A(_Bo(a), doo) = 
A{A, doc)- Doing the same for d2, we obtain that, with probability > 1 — Lexp(— x): 

sup(Z„K)-Z„(a)) < 2Co(l + V^h2(A, da) + (1 + x)-fi{A, d^). (47) 

aeA V ^ 

We can do the same job for sup^^y^{P — Pn){f^'a ~ ^'aj- Note that 
£c.{X, {Yt))-£a.AX, (Yt)) 

r-l 

{a{t,X) - a^t, X))ia{t, X) + a^t, X) - 2ao{t, X))Y{t)dt, 



so using Assumptions 1 and 2, we have \£a{X, {Yt))~ia, {X, {Yt))\ < 2(6+||q;o|1ocOI|Q!— 
a*||oo and 

E[UX, (Yt)) - e^,{X, (Yt)))^] < 4{b + ||ao||oo)'||« - 

Therefore, the Bernstein's inequality (for the sum of i.i.d. random variables) entails 
that 



(p-p„)(4-4J<2(& + ||«olloo)(^ 



a — a.^,\\^/2x \\a — a^,\\ooX 



holds with a probability larger than 1 — e ^. Then, we can apply again the generic 
chaining argument to prove that with a probability larger than 1 — Le~^: 



sup(F- P„)(i'„ -t^J < 4(6+ ||ao||oo)( 1= + • 

qsA V n J 

This concludes the proof of the Theorem. □ 



Proof of Theorem 4 

Proof of Theorem 4. Recall that the linearized risk over A(A) is given by 

R{9) :=^0,P(4J 

ASA 

ioi 9 & &, where we recall that 

AeA 

and the linearized empirical risk is given by 

Rn(^) = ^^AP„(^aJ. 

AeA 

We recall that the mixing estimator a is given by 

d := ^ ^A^A, 
AeA 
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where the Gibbs weights 9 = {9x)x^a :— {9{ax))xeA are given by (35) and are the 
unique solution of the minimization problem (36). By convexity of the risk, we have 
for any e > 0: 

) < (1 + e)(Rn(6') - Pni^aa)) + Tin: 

where we introduced the residual term 

7^„ := R(^) - P(^„J - (1 + e)(R„(0) - P„(^a„)) 

= ^^{Pi^c., - Co) - (1 + e)^n(C. - Co) 
AeA 

Let A be such that is the empirical risk minimizcr in v4(A), namely 

P„(£ ) -minP„(CJ. 
^ AeA 



Since 

\" L Ino- / 

.l/M 



A6A 

where K{9, u) denotes the KuUback-Leibler divergence between the weights 9 and the 
uniform weights u := {l/M)xeA, we have 

Rn{9) < Rn{e) + -K{9,U) 

n 

n n 
AeA 

^ , , TlogM 
< R„(e^^^ ^ 



Pn(CJ + 



n 

TlogM 



where G is the vector with all its coordinates equal to excepted for the A-th 
which is equal to 1. This gives 



and consequently 



P(C - Co) < (1 + e) minP„(C. - Co) + T^n, 

AG A 



E"||a - aoll' < (1 + e) min - a^f + (1 + + E"[7e„]. 

AeA n 

Hence, it remains to prove that for some constant C = f,.||Qlloo ' have 

E"[7^„] < (48) 
n 

Since R(-) and Rn(-) are linear on 0, we have 

7^„< max ((l + e)(p(f„-£„J-P„(^„-£„J) -eP(£„-£„jy 
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The following decomposition holds (see Section 3.4): 

(P - P„)(^„ - £^,) = (P - P„)(4 - 4J + 4=^«(ao - a). 

The Bernstein's inequality for the sum of i.i.d. variables (see the proof of Theorem 1) 
gives 



(P-P„)(4-4J< (6+||ao||oo)( 



lla - ao\\V2x ^ \\a - aoHooa; 



y/n 

so together with Lemma 1, and since P{£a — £aa) = ||Q! — ao|P, we obtain that 



C|| II .J2xPaa-£ao) Cl, II (,a 

(P - Pn){ia ~ iao) < """"--""^ ^ — + ^^^^ 



with probability larger than 1 — 3e ^, where CjaoW b ^llaoH^ / V2 + b+ ||ao||oo and 

^IKIIoc.b (*^l|aolU + 1 + &+ \\a\\oo){b+ llaolloo), with C\ao\\^ given in Lemma 1. 
Now, using the fact that 



Vn 1 + e e n 

we obtain that with a probability larger than 1 — 3e~^: 

(1 + e)(P(^a ~ ^ao) ~ Pn{£a ~ ^Qo)) ~ ^P{^a ~ ^qq) — C'j J| || ^ .b — , 



where C,j|„o||^,b := (C||„o||^,f,)^(l + e)Ve + (1 + e)C||ao||oo.b- ™^ subexponential 
deviation entails that for any x > 0: 

E" 7e„ <2x+ ^ 

■' n 

where C — Cf^\\ao\\^,b- If we denote by x{y) the unique solution of x = j/exp(— x), 
where y > 0, we obtain 

for the choice x — Cx{M)/n, since we have x{M) < logAf. This concludes the proof 
of Theorem 4. □ 

Proof of Theorem 6. Assume for now that (38) holds. Take S such that 
\vj^ — W0I2 < A, and let m* — (771^,7712) be the oracle dimension of the sieve for the 
link function, that satisfies (37) with d = 1. Denote for short the oracle estimator 

that is, the element of A^* that minimizes the empirical risk computed using the 
training sample Dn,i(v/s.)- 

Note that the cardinality of 5*^ is smaller than c/A''^^, where A = (nlogn)^^/^, 
so the cardinality of the whole dictionary {a,„ : m e M.n} U {a^^ ■ ™ ^ v G 
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S'^ ^} is of order cn'^'^ ^^/^(logn)^"'"^/^ + (logn)''"''^. As a consequence, Theorem 4 
gives 



r2n| 



aolP < 2E"||a, 



||2 , log'^ 



Note that (41) entails \\(3q{-,v\-) ~ (3q{- , vj ■)\\'^ < cA^ = c/(nlogn). Hence, 



\a,-aor < 2||a* -/3o(-,fA-)lr + 



2c 
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We shall denote in what follows by E" the expectation wrt P", the joint law of the 
observations when the intensity writes Po{-,v^ ■) (the true index is v). For two indexes 
v,vo G S'l~^, we introduce the following likelihood ratio: 



L„{vo,v) 



dP 



/3o(-,fJ 



/3o(-,f 



which is the likelihood ratio of the training data Dn,i "between" the two indexes 
V and 7;o. It can be explicitly computed using Jacod's formula, see Appendix A. 2 
below. Of course, when v and vq are close to each other, we expect Ln{vo,v) to be 
small. This is the statement of the next Lemma. 

Lemma 4. Grant Assumption 5, and let v,vo G S'^'^^ be such that \\v — fo||2 < A„, 
where A„ = (7ilogn)^^/^. Then, if n is large enough, one has for any x > 0: 

where c — 6o/(2(icQ). 

The proof of this Lemma can be found below. It uses the same kind of arguments 
as the proof of Proposition 1. Let a; > to be chosen later on, and decompose the 
expectation over {Ln{va,VA) > x} and {Ln{vo,VA) < x} to get 



«*-/?o(-,<-)lr = E" [||a*-/3o(-,f. 



a:L„{vo,VA)] 



+ E;y||a* -/3o(-,i'A-)ll 1l„(.„,.^)>J 
so using Assumption 3 and Lemma 4, we obtain 

so for X = e^/ we have 

E:ja.-(3o{-,vl-)r<cE:ja,-(3o{-,vl-)r + ^. 

But, E"^ /3o(-, wJOlP is nothing but the risk of the minimizer (3m* of the empirical 
risk R^^f^ over the sieve Am*- in this risk, the "true covariate" is now v^X. Indeed, 

T M|2 



r-l 



(/3™. {t, v^x) - (3o{t, v^x)yE[Y{t)\X = x]dtPx{dx) 



CPm*{t.x') - Po{t,x')fE[Y{t)\vlX = x']dtP^Tx{dx') 
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so conducting the same analysis as in Section 5.1, we can prove that the choice of m* 
entails that 

This concludes the proof of Theorem 6 in the single-index case. If (38) does not hold, 
then in the oracle inequality we take the oracle purely nonparametric element, using 
the same analysis as in Section 5.1. □ 

Proof of Lemma 4- In view of Equation (54), see Appendix A. 2, we can write, us- 
ing (2): 

n „i 

logL„(«o,«) = J2 {c^oAi.Xi)dN\t) - T,,^,{t,Xi)Y\t)dtj 

z=l 

= V / C,,,,{t,Xi)dM\t) 

n „i 

+ Y. [c^oAt,Xi)l3Q{t,v'^X,)-T,,^,{t,Xi)^Y'{t)dt, 
i=i •'^ 

where we shall use the notations 

T„„,„(t,X,) Poit,v^X,) - PoiWXi) 
C^,,^{t,X,) logf3o{t,vJX,) - log/3o(i,w^X,) 

throughout the proof the Lemma. Now, fix some h > (to be chosen later on) and 
write 

P"jL„{vo,v)>x] 
<El[L„{vo,v)'^]e-^'°^^ 

n „i 

= E;[exp(^/i/ C,„At,X,)dM^t) 
i=i •'^ 

n „i 

+ ^Y. [c,,oAt,X^)l3o{t,v'^X,)-T,,^,{t,X,)Y\t)dt-h\ogx) . 
i=i -^o 

We follow the main steps of the proof of Proposition 1 . Define 

Um := h f C,,,,{s,X,)dM\s) ~ Slit) := hO^t) ~ Sl{t), 
Jo 

where S'^(t) is the compensator of 

^h'iO^'^t)) + ^ (exp(/i|AO^(s)|) - 1 - h\AO\s)\) , (49) 

s<t 

where O*'^ is the continuous part of the process O*. We know from the proof of 
Lemma 2.2 and Corollary 2.3 of van de Geer (1995), see also Liptser and Shiryayev 
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(1989), that exp(f/^) = exp(/iO' — 5'^), for i = 1, . . . ,n are i.i.d. super-martingales. 
As a consequence, we get: 

n 

P.c[in(«o,i^) > < EV2[exp (2^(7^(i))] Ei/2[L„(1)] < E^f [L„(l)], 

1=1 

where 

L„(l) -.^ exp (2J2 {Slil) + h / {£„„,„(i,X,)/3o(i,t'J"^«)-T,„,„(t,X,)}r'(t)di} 
-2/ilogC). 

We are now estabUshing an upper bound for Ei,{^[L„(l)]. Looking closer to the process 
Sl^{t), we can write: 

rt 



k>2 ^0 



where the processes Vj^ have been defined in the proof of Proposition 1. Assumption 5 
and the fact that — i'o|i2 < A gives 

|T,„^,(f,a;)| <co\/dA:=e (50) 

for any t>Q and x S [0,1]''. In particular, we have \T ^^.^{t^x)] < bo/2 when n is 
large enough. This allows to write: 

\C,,At,X,)\ < *i/^„(t,„T^^)(T„„,„(i,X,)) X (l//3o(t,«c[X0) 

where ^'a(a;) := — log(l — ax)/a for a > and x < l/a. Since 'i'a{x) = — log(l — 
ax) /a <x + ax^ for any x £ [0, l/(2a)], we obtain 

\r (i Y\\^ |T„o,t,(i,Xi)| / \Ty^^^y{t,Xi)\ 



We can write, as a consequence: 



<(t-)(1 ' 



k>2 ■ "^0 

< /'V.o.«(^,^.)l'/3o(s,fo^W^(.)d.x5]^(^)'"'(l + ^''"' 



< ^ |£,„,,(s,X,)P/3o(s,i;c[^.)>^Xs)c?s X y(l + C;,), 

where 
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Note that Ch < I for he and e small enough. We obtain 

r-l 



E:;„[L„(1)] < E,„[exp(n{ \C^,,,{t, X)f f]o{t,v^ X)Y{t)dt x h\l + c^) 

— 2/ilogx J 



Using again the above trick involving the function ^a, we obtain: 



{t, Xi)rjo{t,v^ X,) - T,„^,{t, X,) < 



< 



P^{t,v^X,) - bo' 



Using the fact that log{x/y)'^x < 2e^/(x A y) for any x,y > such that |a; — ?/| < 
e < (x A y)/2 and e > small enough [decompose over {x < y} and {a; > y} and use 
again the previous majoration of \l/a(a;)], we have in view of (50): 

Cy„^y{t,x)'^P0{t,VQX) < — 

for any t > and x G [0, 1]'' and e small enough. In fine, we get, using the fact that 
y < 1, 



E:jL„(l)]<E„„[exp(n / { 



2e2/i2 2he^ 



^Y\t)dt~2h\ogx 



f2ne h , , , , 

< exp — (1 + /i) - 2/iloga; 

V 5o 



'2ne^h, 

i + /ij - 'ihiy 

for any /i > 0, so for the choice h — bo logx/(2ne^), we obtain 

PllLnivo, v)>x]< exp ( - ^^^^^ 
and the conclusion follows, since A = 1 / y/n log n and ne^ = dc^/ \ogn. 



□ 



A Appendix 

A.l Some tools from approximation theory 

Let us give two examples of sieves, that are spanned by localized basis. In each case, 
we give the control on f[A) and we give a standard but useful approximation result 
below. Note that other examples of sieves are available, see Barron et al. (1999) for 
instance. 

A. 1.1 Piecewise polynomials 

Fix ^1, . . . , Id+i G N and toi, . . . , md+i £ N, and define the set ^ of rectangles 
Ut^liUi - 1)2-™*, ji2-"-[ for < ji < 2"-, i ^ 0,...,d+l. So, ^ is a regular 
partition of [0, Ij^'^^. Take m = (mi, . . . , rrid+i) and define Am as the set of functions 
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/ : [0, Ij''^^ M such that for any R Q the restriction of / to i? coincides with 
a polynomial of degree not larger than li in the ith coordinate, for i — I, . . . , d + I. 
The dimension of A^n is then 

d+l 

1=1 

and using Barron et al. (1999), see Section 3.2.1, we have, since ^ is a regular parti- 
tion, 

where Q,,...,,,^,,^ = {Wtlih + mU + l)Y'^- 
A. 1.2 Wavelets 

Consider a pair {0, ^} of scaling function and wavelet, where -0 has K vanishing 
moments. Then (p and have a support width of at least 2K — 1, and there is a 
pair with minimal support, see Daubechies (1988). This is the starting point of the 
construction of an orthonormal wavelet basis of L^[0, 1], as proposed in Cohen ct al. 
(1993). Roughly, the idea is to retain the interior scaling functions (those "far" from 
the edges and 1), and to add adapted edge scaling functions, see Section 4 and 
Theorem 4.4 in Cohen et al. (1993). This construction allows to keep the orthonor- 
mality of the system and the number of vanishing moment unchanged, as well as the 
number 2^ of scaling function at each resolution j. More precisely, if I is such that 
2' > 2K, consider for j > ^ - 1: 





if J 


> I and A: = 0, . 


.,K-1 






if J- 


> I and k = K, 




- 1 




if J 


> ; and A: = 2^ - 


-K,...,2^ 


- 1 




if J 


= ^ — 1 and k — 


0,...,K- 


1 


4>l,k 


if J 


— I ~ 1 and k — 


K,...,2'- 


-K -1 


Mm 


if J 


= 1 — 1 and k = 


2^ - K,... 


,2'-l 



where </)j^fc = 2^^'^(l){2^ -—x) and V'j.fe — 2^^^^p{2^ -—x) are the " interior" dilatations and 
translations of and 0° fe, V'^fc, fc, '/'j ^ are, at each resolution j, dilatations of 

2K edge scaling functions and wavelets {K for each edge) . We know from Cohen et al. 
(1993) that the collection 

W := {^j^k : J > I - hk = 0, . . . ,V - 1} 

is an orthonormal basis of L^[0, 1], and the interior and edge wavelets have K van- 
ishing moments. Let W^^\ i = 1, . . . ,d + 1 be several collections W based on pairs 
{(/)'^*\ (possibly with different numbers of vanishing moments). Then, the col- 
lection 

{® -i' : > - 1, fc, = 0, . . . , 2^' - 1, i = 1, . . . , d + 1}, 

where(g)f+j^*^.J^^,_(a;i, . . . ,Xd+i) = l\t=l ^f-,kS^i)^ is an orthonormal basis of L^p, I]''" 
that has suitable approximation properties for a function with an anisotropic smooth- 
ness, see below. Let m = (?7ii, . . . ,md+i) E N'*'^^ be fixed, where m.i > li for any 



33 



i e {1, . . . , d + 1}, and define the sieve 

Am := span{*A : A G A(m)}, (51) 
where for A = [ji, ki, . . ■ , jd+i, kd+i), 



and where 



A(m) = {(ji, fci, . . . , jd+i, A:d+i) : - 1 < J, < rrii, 

h^0,...,2^' -l,i = l,...,d+l} 

The dimension of A^ is IlS where D™, = 2"- - 2'- < 2™-. The control of 

r(Am) easily follows from the fact that if the resolution levels ji > U are fixed for any 
« = 1, . . . , d + 1, the tensor products have disjoint supports, excepted for a finite 
number of indexes ki, that depends only on the support of the scaling and mother 
wavelet functions used in the construction of W . As a consequence, we have 

, / 1 IIEagam/^a^aIIoo 
r{A„,) < -j=- sup '—^ < c*, 

where = Oit^i ^"lii \P\oo = supAgA(m) I/^aI and where is a constant that 
depends only the scaling and mother wavelet functions used in the construction of 
the basis, and not on the resolution level m. 

In the next section, we give the definition of the anisotropic Besov space, and 
recall a useful approximation result. The definitions and results presented here can 
be found in Triebel (2006), in particular in Chapter 5 which is about anisotropic 
spaces. 

A. 1.3 Anisotropic Besov space, approximation 

Let {ei, . . . , e^+i} be the canonical basis of M''+^ and s — {si, . . . , Sd+i) with Si > 
be a vector of directional smoothness, where Si corresponds to the smoothness in 
direction e^. If fc S N and x G define 



:= {x e R'^+i : X + je G [0, 1]'*+^ for j = 0, . . . , fc}. 

If / : [0,1]''+^ K, we define A^/ as the difference of order fc > 1 and step 
e e [0,l]''+i, given by Alf^x) = f{x + e) - /(x) and A^J{x) = Ai(A^i/)(^) 
for any x G i^g . We say that / G L^[0, 1]^*+^ belongs to the anisotropic Besov space 
B|o^([0, 1]''+^) if the semi-norm 



:=sup(5]<-- sup (/ {AtjixWdxY^') (52) 

t>0 h:\h\<t ^J&ll ' ' 



is finite. We know that the norms 

II/IIb" :=||/||2 + |/|i 
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are equivalent for any choice of ki > s^. Note that if s = (s, . . . , s) for some s > 0, 
then ^ is the standard isotropic Besov space. Moreover, the embedding i?| 2 C 
-B| ^ holds. When s = (si, . . . , Sd+i) has integer coordinates, -B| 2 is the anisotropic 
Sobolev space 

d+l 

BI2 = VK| = {/ e 



l — l ^ 



If s has non- integer coordinates, then B22 is the anisotropic Bessel-potential space 

d+l 



{/eL2:^||(l + |e.Pr'/V'(0||^<oo}, 



where / is the Fourier transform of /. If / G Sfj^, we can give a control on the 
approximation term infctg^ ||q — ao|| , when A is spanned by piecewise polynomials or 
wavelets (see above). Indeed, the next Lemma is a direct consequence of the Jackson's 
estimate given in Hochmuth (2002), together with definition (52) of the Besov space. 
Note that this Lemma can be also found in Comte et al. (2008) and Lacour (2007). 

Lemma 5. Assume that chq e -Bf^^ where s = (si, . . . , s^+i) and let U > Si for 
i = 1, . . . , d + 1. Let Am be either: 

• the piecewise polynomial sieve {see Section A. 1.1) with degrees h in the ith 
coordinate, based on a partition with rectangles of sidelengthes 2~™', or 

• the wavelet sieve {see Section A. 1.2), where the wavelets have U vanishing mo- 
ments in the ith coordinate. 

Then, there is a constant c — Cs d > ^ such that 



inf lla - aolU < claols; 2" 

i=l 



A. 2 Some tools from the theory of counting processes and 
stochastic calculus 

Let be the joint law of {{X , N {t) ,Y {t)) : t € [0, 1]} when (2) holds (the intensity 
is ao). We want to explain why the log-likelihood ratio ^(a, ao) \og{dPa / dPao) 
writes, when both a and ao are assumed to be positive on [0, 1]''+^: 

£(a,ao)= / log( "^,^:^\ )rfiV(t)- / {a{t,X) - ao{t,X))Y{t)dt. (53) 
Jo \ao{t,X)J Jo 

This will entail that the log-likelihood ratio £n{oi,ao) := \og{dP^ / dP^^) of the inde- 
pendent sample (3) satisfies 

£„(a, "0) = E ( I' log { 2o{t% y^^^^^~ fo^""^^' X,)-Mt, X,))Y\t)dt) . (54) 

Equation (54) is useful in several parts of the paper (dimension reduction and lower 
bounds). 
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First, we recall Jacod's formula (see Andersen ct al. (1993)) for the likelihood of 
a counting process. It writes, for the likelihood of N: 



7r,e[o.i]{(ao(i,W(i))'^'^^*^(l-«o(i,^)nO)'"^'^^*^}rfi, 

where AN{t) = N{t)-N{t^) and where TT is the product-integral, see Andersen et al, 
(1993) for a definition. But N has a finite number of jumps on [0, 1] and AN{t) G 
{0, 1} for any t G [0, 1], thus 1 — AN{t) = 1 for any t S [0, 1] excepted a finite number 
of times. Consequently the likelihood of N reduces to 



n (ao(i,X)y(t))^^(*) 




ao{t,X)Y{t)dt 



where the first product is actually finite, and where we used the fact that TTtgjo^i] (1 — 
f{t)) — ex-p{— f{t)dt) for a continuous function / on [0,1]. Thus, the likelihood 
ratio L(a,Q!o) — dPa/dPaa writes 

-n- / a(t X) \^N(t) / \ 
L{a,ao)= n {-\rX)) ( " / Ht,X) - ao{t,X)Y{t))dt), 

which entails (53) since Eie[o,i] = /oV(i)diV(t). Equation (54) is a 

consequence of (53), together with the fact since iV^, . . . , TV" are independent, they 
cannot jump at the same time, so that AN^{t) G {0, 1} a.s. 
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